outline

 

dissimilarity- and distance-based unsupervised learning

 

independence-based vs association-based

 

taking into account continuous/categorical interactions

 

example and future work

 

dissimilarity- and distance-based unsupervised learning

learning from dissimilarities

some unsupervised learning methods take as input a dissimilarity matrix

 

dimension reduction: multidimensional scaling (MDS)1

 

clustering methods: hierarchical (HC) and partitioning around medoids (PAM)2

 

the dissimilarity measure of choice is key, obviously

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous variables: add up by-variable (absolute value or squared) differences

intuition

2 continuous and 1 categorical variables

intuition

one might consider purple and blue closer than e.g. purple and yellow

independence-based

Most commonly used dissimilarity (or, distance) measures are based on by-variable differences that are then added together

 

  • in the continuous case: Euclidean or Manhattan distances

 

  • in the categorical case: Hamming (matching) distance (among MANY others)

 

  • in the mixed data case: Gower dissimilarity index

 

no inter-variable relations are considered \(\rightarrow\) independence-based

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

  • When variables are correlated or associated, shared information is effectively counted multiple times

  • inflated dissimilarities may cause potential distortions in downstream unsupervised learning tasks.

independence-based

The Euclidean distance \(\longrightarrow\) shared information is over-counted

association-based

The Mahalanobis distance \(\longrightarrow\) shared information is not over-counted

this is an association-based distance for continuous data

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for continuous: Mahalanobis distance

Let \({\bf X}_{con}\) be \(n\times Q_{d}\) a data matrix of \(n\) observations described by \(Q_{d}\) continuous variables, and let \(\bf S\) the sample covariance matrix, the Mahalanobis distance matrix is

\[ {\bf D}_{mah} = \left[\operatorname{diag}({\bf G})\,{\bf 1}_{n}^{\sf T} + {\bf 1}_{n}\,\operatorname{diag}({\bf G})^{\sf T} - 2{\bf G}\right]^{\odot 1/2} \] where

  • \([\cdot]^{\odot 1/2}\) denotes the element-wise square root

  • \({\bf G}=({\bf C}{\bf X}_{con}){\bf S}^{-1}({\bf C}{\bf X}_{con})^{\sf T}\) is the Mahalanobis Gram matrix

  • \({\bf C}={\bf I}_{n}-\tfrac{1}{n}{\bf 1}_{n}{\bf 1}_{n}^{\sf T}\) is the centering operator

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1

To distance matrix \({\bf D}_{tvd}\) is defined using the so-called delta framework2 a general way to define categorical data distances.

Let \({\bf X}_{cat}\) be \(n\times Q_{c}\) a data matrix of \(n\) observations described by \(Q_{c}\) categorical variables.

\[ {\bf D} = {\bf Z}{\Delta}{\bf Z}^{\sf T} = \left[\begin{array}{ccc} {\bf Z}_{1} & \dots & {\bf Z}_{Q_{c}} \end{array} \right]\left[\begin{array}{ccc} {\bf\Delta}_1 & & \\ & \ddots &\\ & & {\bf\Delta}_{Q_{c}} \end{array} \right] \left[ \begin{array}{c} {\bf Z}_{1}^{\sf T}\\ \vdots \\ {\bf Z}_{Q_{c}}^{\sf T} \end{array} \right] \] - where \({\bf Z}=[{\bf Z}_1,\ldots,{\bf Z}_{Q_c}]\) is the super-indicator matrix, with \(Q^{*}=\sum_{j=1}^{Q_c} q_j\)

  • \({\Delta}_j\) is the category dissimilarity matrix for variable \(j\), i.e., the \(j\)th diagonal block of the block-diagonal matrix \({\Delta}\).

  • setting \({\Delta}_j\) determines the categorical distance measure of choice (independent- or association-based)

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (2)

Consider the empirical joint probability distributions stored in the off-diagonal blocks of \({\bf P}\):

\[ {\bf P} = \frac{1}{n} \begin{bmatrix} {\bf Z}_1^{\sf T}{\bf Z}_1 & {\bf Z}_1^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_1^{\sf T}{\bf Z}_{Q_c} \\ \vdots & \ddots & \vdots & \vdots \\ {\bf Z}_{Q_c}^{\sf T}{\bf Z}_1 & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_2 & \cdots & {\bf Z}_{Q_c}^{\sf T}{\bf Z}_{Q_c} \end{bmatrix}. \]

We refer to the conditional probability distributions for each variable \(j\) given each variable \(i\) (\(i,j=1,\ldots,Q_c\), \(i\neq j\)), stored in the block matrix

\[ {\bf R} = {\bf P}_z^{-1}({\bf P} - {\bf P}_z). \]

where \({\bf P}_z = {\bf P} \odot {\bf I}_{Q^*}\), and \({\bf I}_{Q^*}\) is the \(Q^*\times Q^*\) identity matrix.

association-based pairwise distance

  • differences in line with the inter-variables association/correlation are down-weighted

Association-based for categorical: total variation distance (TVD)1 (3)

Let \({\bf r}^{ji}_a\) and \({\bf r}^{ji}_b\) be the rows of \({\bf R}_{ji}\), the \((j,i)\)th off-diagonal block of \({\bf R}\) The category dissimilarity between \(a\) and \(b\) for variable \(j\) based on the total variation distance (TVD) is defined as

\[ \delta^{j}_{tvd}(a,b) = \sum_{i\neq j}^{Q_c} w_{ji} \Phi^{ji}({\bf r}^{ji}_{a},{\bf r}^{ji}_{b}) = \sum_{i\neq j}^{Q_c} w_{ji} \left[\frac{1}{2}\sum_{\ell=1}^{q_i} |{\bf r}^{ji}_{a\ell}-{\bf r}^{ji}_{b\ell}|\right], \label{ab_delta} \]

where \(w_{ji}=1/(Q_c-1)\) for equal weighting (can be user-defined).

TVD-based dissimilarity matrix is, therefore,

\[ {\bf D}_{tvd}= {\bf Z}{\Delta}^{(tvd)}{\bf Z}^{\sf T}. \]

AB for mixed?

association-based for mixed

A straightforward AB-distance for mixed data is given by the convex combination of Mahalanobis and TVD distances:

\[ {\bf D}_{mix} =\frac{Q_{d}}{Q}\,{\bf D}_{mah} +\left(1-\frac{Q_{d}}{Q}\right){\bf D}_{tvd}. \]

  • this distance only accounts for correlations or associations among variables of the same type

  • no continuous–categorical interactions are considered.

 

how to measure interactions?

how to measure interactions

define \(\Delta^{int}\), that accounts for the interactions and augment \(\Delta^{tvd}\)

  • the dissimilarity measure becomes

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

where

\[ {\bf D}_{cat}^{(int)}={\bf Z}\tilde{\Delta}{\bf Z}^\top \] and

\[ \tilde{\Delta} = (1-\alpha)\Delta^{tvd} + \alpha \Delta^{int} \] where \(\alpha=\frac{1}{Q_{c}}\).

how to measure interactions

What is \(\Delta^{int}\)?

  • the general entry for the \(j^{th}\) diagonal block is \(\delta_{int}^{j}(a,b)\) accounts for the interaction by measuring how the continuous variables help in discriminating between the observations choosing category \(a\) and those choosing category \(b\) for the \(j^{th}\) categorical variable
  • consider the computation of \(\delta_{int}^{ij}\left({ab}\right)\) as a two-class (\(a/b\)) classification problem, with the continuous variables as predictors
    • use a distance-based classifier: nearest-neighbors

\(\Delta^{int}_{j}\)

  • consider \({\bf D}_{mah}\) and sort it to identify the neighbors for each observation.

  • set a proportion of neighbors to consider, say \(\hat{\pi}_{nn}=0.1\)

  • for each pair of categories \((a,b)\), \(a,b=1,\ldots,q_{j}\), \(a\neq b\) of the \(j^{th}\) categorical variable:

  • classify the observations using the prior corrected1 decision rule

    \[ \text{if $i$ is such that }\ \ \ \frac{\hat{\pi}_{nn}(a)}{\hat{\pi}(a)}\geq\frac{\hat{\pi}_{nn}(b)}{\hat{\pi}(b)} \ \ \ \text{ then assign $i$ to class $a$ else to class $b$} \]

  • compute the balanced accuracy2 (average of class-wise sensitivities) \[ \delta_{int}^{j}(a,b)=\frac{1}{2}\left(\frac{\texttt{true } a}{\texttt{true } a + \texttt{false }a}+\frac{\texttt{true } b}{\texttt{true } b + \texttt{false }b}\right) \]

well separated or not

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \cdot & \cdot & \cdot \\ \cdot & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & \color{indianred}{0.94} & \cdot & \cdot \\ \color{indianred}{0.94} & 0 & \cdot & \cdot \\ \cdot & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & \color{indianred}{0.4} & \cdot \\ 0.94 & 0 & \cdot & \cdot \\ \color{indianred}{0.4} & \cdot & 0 & \cdot\\ \cdot & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & \color{indianred}{0.39} \\ 0.94 & 0 & \cdot & \cdot \\ 0.4 & \cdot & 0 & \cdot\\ \color{indianred}{0.39} & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & \color{indianred}{0.54} & \cdot \\ 0.4 & \color{indianred}{0.54} & 0 & \cdot \\ 0.39 & \cdot & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & \color{indianred}{0.55} \\ 0.4 & 0.54 & 0 & \cdot \\ 0.39 & \color{indianred}{0.55} & \cdot & 0 \end{pmatrix} \]

Building \(\Delta^{int}_{j}\)

for the general categorical variable \(j\) with \(q_{j}\) you compute the quantities for \(\frac{q_j(q_j -1)}{2}\) pairs

 

 

 

 

\[ \Delta_{int} = \begin{pmatrix} 0 & 0.94 & 0.4 & 0.39 \\ 0.94 & 0 & 0.54 & 0.55 \\ 0.4 & 0.54 & 0 & \color{indianred}{0} \\ 0.39 & 0.55 & \color{indianred}{0} & 0 \end{pmatrix} \]

Just one-way interaction?

Just one-way interaction?

Let \({\bf X}=\left[{\bf X}_{con}{\bf X}_{cat} \right]\) be a mixed data matrix with \(n\) observations described by \(Q_{d}\) continuous and \(Q_{c}\) categorical variables, respectively, and let \({\bf x}_{i}=\left[{\bf x}_{i_{con}}{\bf x}_{i_{cat}}\right]\) the \(i^{th}\) observation

We build upon the following results

result A

The distribution of \({\bf x}_{i}\) can be written as

\[ f({\bf x}_{i_{con}},{\bf x}_{i_{cat}})=f({\bf x}_{i_{con}})f({\bf x}_{i_{con}}\mid{\bf x}_{i_{cat}}) \]

result B

According to Hennig et al. (2019)1, starting from an arbitrary distance \(d(x, c_{k})\) from a prototype, it possible to construct a probabilistic clustering model2 as \(f(x;c_{k},s_{k})=g(c_{k},s_{k})exp(-s_{k}d(x,c_{k})\), where \(s_{k}\) is a concentration parameter.

Then if \(f(\cdot)\sim N({\bf c}_{k}, {\bf S}_{k})\), and \(d(\cdot,\cdot)\) the Mahalanobis distance, it results that

\[ f({\bf x}_{i};{\bf c}_k, {\bf S}_k)=g({\bf c}_k, {\bf S}_k)\exp\left[-d({\bf x}_{i},{\bf c}_k,)\right] \] where \(g({\bf c}_k, {\bf S}_k)\) is a normalization constant to make sure \(f({\bf x}_{i};{\bf c}_k, {\bf S}_k)\) is a density function.

Just one-way interaction?

result C

the dissimilarity between \({\bf x}_{i_{con}}\) and a generic cluster \(k\) with center \({\bf c}_k\) and covariance matrix \({\bf S}_k\) is1

\[ d({\bf x}_{i_{con}},{\bf c}_k)=\log(M_{k} f({\bf x}_{i_{con}};{\bf c}_k, {\bf S}_k)^{-1}) \]

  • \(f(\cdot)\) a symmetric probability density function

  • \(M_k\) is the maximum of the density function

  • that is, if \({\bf x}_{i_{con}}={\bf c}_k\) then \(d({\bf x}_{i_{con}},{\bf c}_k)=0\).

replace \({\bf c}_{k}\) with a generic observation \(i'\), the above becomes

\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\log(M f({\bf x}_{i_{con}};{\bf x}_{i'_{con}}, {\bf S})^{-1}) \]

where \(M\) is the maximum of the density function.

Just one-way interaction…

Using the result C, it can be shown that

\[ d({\bf x}_{i_{con}},{\bf x}_{i'_{con}})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T} \]

categorical analogue

consider the \(Q_{cat}\)-dimensional categorical vector \({\bf x}_{i_{cat}}\), it results, for its \(j^{th}\) element, that

\[ p(x_{ij_{cat}}=a| x_{i'j_{cat}}=b) = \left[\delta^{j}(a,b)\right]^{-1} \]

where \(a\) and \(b\) are two general categories of the \(jth\) variable

For the whole vector it results \[ p({\bf x}_{ij_{cat}}; {\bf x}_{i'j_{cat}}) = \prod_{j=1}^{Q_c} p(x_{ij_{cat}}=a_{j}; x_{i'j_{cat}}=b_{j})=\prod_{j=1}^{Q_c}\left[\delta^{j}(a_{j},b_{j})\right]^{-1} \]

Just one-way interaction… (wrap-up)

using the previous results, it is possible to define the dissimilarity between two mixed-data observations

\({\bf x}_i\) and \({\bf x}_{i'}\)

\[ d({\bf x}_{i}, {\bf x}_{i'})=\frac{1}{2}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}}) {\bf S}^{-1}({\bf x}_{i_{con}}-{\bf x}_{i'_{con}})^{\sf T}-\log(p({\bf x}_{i_{cat}}|{\bf x}_{i_{con}}; {\bf x}_{i'_{cat}}, {\bf x}_{i'_{con}} )) \]

that takes into account correlations, associations and cross-type interactions and is a equivalent to the genral entry of the previously defined

\[ {\bf D}_{mix}^{(int)} = {\bf D}_{mah} + {\bf D}_{cat}^{(int)}. \]

spectral clustering in a nutshell

Spectral clustering: a graph partitioning problem

Graph representation

a graph representation of the data matrix \(\bf X\): the aim is then to cut it into K groups (clusters)

the affinity matrix \({\bf A}\)

the elements \(\bf w_{ij}\) of \(\bf A\) are high (low) if \(i\) and \(j\) are in the same (different) groups

. a b c d
a 0 0 w_ac 0
b 0 0 w_cb w_bd
c w_ca w_cb 0 w_cd
d 0 w_db w_dc 0

Spectral clustering: making the graph easy to cut

An approximated solution to the graph partitioning problem:

  • spectral decomposition of the graph Laplacian matrix, that is a normalized version of the affinity matrix \({\bf A}\):

\[\color{dodgerblue}{\bf{L}} = {\bf D}_{r}^{-1/2}\underset{\color{grey}{\text{affinity matrix } {\bf A}}}{\color{dodgerblue}{exp(-{\bf D}^{2}(2\sigma^{2})^{-1})}}{\bf D}_{r}^{-1/2} =\color{dodgerblue}{{\bf Q}{\Lambda}{\bf Q}^{\sf T}}\]

  • \(\bf D\) be the \(n\times n\) matrix of pairwise distances

  • the \(\sigma\) parameter dictates the number of neighbors each observation is linked to (rule of thumb: median distance to the 20th nearest neighbor)

  • diagonal terms of \(\bf A\) are set to zero: \(a_{ii}=0\) , \(i=1,\ldots,n\)

  • \({\bf D}_{r}=diag({\bf r})\) , \({\bf r}={\bf A}{\bf 1}\) and \({\bf 1}\) is an \(n\)-dimensional vector of 1’s

  • the spectral clustering of the \(n\) original objects is a \(K\)-means applied on the rows of the matrix \({\bf{\tilde Q}}\), containing the first \(K\) columns of \(\bf Q\)

Spectral clustering: solution and performance

SC works well, particularly in case of non-convex and overlapping clusters1

a toy experiment

toy experiment: data generation

is \(\delta_{int}^{j}(a,b)\) of help?

  • three continuous signal, three continuous of Gaussian noise, three categorical variables (one noise)

  • different dependence structures and location shifts in the signal cont inuous across the groups defined by the signal categorical variables.

  • the continuous signal variables are generated conditionally on the signal categorical variables

\[ X_j = \beta_{0,nm} + \sum_{k>j}^{6} \beta_{1,nm} X_k, \qquad j = 1,2,3 \] where \(\beta_{1,nm}\) take different values depending on the \(m\) and \(n\), the observed categories of the two signal categorical variables, respectively.

toy experiment

toy experiment: compared methods

  • gower dissimilarity: a straightforward option

  • naive 1

  • modified gower2

  • association-based approach3: with and without interaction

toy experiment

Final considerations and future work

  • association-based measures aim to go beyond match/mismatch of categories

  • when the signal is limited to few variables, retrieving information from cont/cat interaction may be useful

  • measuring interactions via non-parametric approach NN-based approach is suitable for non-convex/oddly shaped clusters

  • computationally demanding (but it can be made bearable)

  • \(\pi_{nn}\) tuning, regularization of \(\delta_{int}\)’s to reduce variability

an R package to compute distances: manydist! (it’s on CRAN!)

general features

The manydist package is designed around a modular distance-based workflow.

Component Function / Interface Role
Core distance constructor mdist() Computes a pairwise dissimilarity matrix for mixed-type data: different indipendence-based and association based are specified, fully customizable
Recipe step step_mdist() Preprocessing step that computes an mdist dissimilarity matrix within a recipe, for seamless integration in learning pipelines.
k-nearest neighbors mdist_knn() Model specification for k-NN using a precomputed mdist dissimilarity matrix as input. Supports tune
leave-one-variable-out lovo_mdist() LOVO dissimilarity computations to assess by-variable contributions

forthcoming features

Component Function / Interface Role
Partitioning around medoids mdist_pam() Model specification for PAM clustering based on mdist dissimilarities.
spectral clustering mdist_spectral() Model specification for spectral clustering based on mdist dissimilarities.
and more to come…

main references

Hennig, C., C. Viroli, and L. Anderlucci (2019). “Quantile-based clustering”. In: Electronic Journal of Statistics 13.2, pp. 4849 - 4883. DOI: 10.1214/19-EJS1640. URL: https://doi.org/10.1214/19-EJS1640.

Le, S. Q. and T. B. Ho (2005a). “An association-based dissimilarity measure for categorical data”. In: Pattern Recognition Letters 26.16, pp. 2549-2557.

Mbuga, F. and C. Tortora (2021a). “Spectral Clustering of Mixed-Type Data”. In: Stats 5.1, pp. 1-11.

Murugesan, N., I. Cho, and C. Tortora (2021). “Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN, and K-Means”. In: Conference of the International Federation of Classification Societies. Springer. , pp. 175-185.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2024). “A general framework for implementing distances for categorical variables”. In: Pattern Recognition 153, p. 110547.

Velden, M. van de, A. Iodice D’Enza, A. Markos, et al. (2025a). “A general framework for unbiased mixed-variables distances”. In: second round review: Journal of Computational and Graphical Statistics.

Acknowledgments

The publication was produced with funding from the Italian Ministry of University and Research under the Call for Proposals related to the scrolling of the final rankings of the PRIN 2022 call.

Latent variable models and dimensionality reduction methods for complex data (PI Prof. Paolo Giordani)

Project No. 20224CRB9E, CUP C53C24000730006

Grant Assignment Decree No. 1401 adopted on 18.9.2024 by MUR